Code
library(tidyverse)
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
dplyr
is a core tidyverse package that provides a consistent and intuitive grammar for data manipulation.
First, we load the tidyverse
suite of packages.
We will use the built-in mtcars
dataset for these examples. For simplicity, we select a few columns and convert the row names to a proper column called car_name
.
car_name cyl mpg hp
1 Mazda RX4 6 21.0 110
2 Mazda RX4 Wag 6 21.0 110
3 Datsun 710 4 22.8 93
4 Hornet 4 Drive 6 21.4 110
5 Hornet Sportabout 8 18.7 175
6 Valiant 6 18.1 105
The names()
function returns a character vector of the column names in a data frame.
The select()
verb chooses columns by name.
select()
can be used with helper functions like contains()
to choose columns based on patterns in their names.
You can also select columns by their position.
Use the -
sign to deselect or drop columns.
The rename()
verb changes the name of a column.
mutate()
adds new columns or transforms existing ones.
if_else
if_else()
is a vectorized conditional statement, useful inside mutate()
.
case_when
case_when()
is useful for multiple conditions (like a multi-part if-else statement).
car_name cyl mpg hp cly_group
1 Mazda RX4 6 21.0 110 big
2 Mazda RX4 Wag 6 21.0 110 big
3 Datsun 710 4 22.8 93 other
4 Hornet 4 Drive 6 21.4 110 big
5 Hornet Sportabout 8 18.7 175 very big
6 Valiant 6 18.1 105 big
transmute()
is like mutate()
, but it drops all other columns, keeping only the newly created ones.
filter()
subsets rows based on logical conditions.
car_name cyl mpg hp
1 Mazda RX4 6 21.0 110
2 Mazda RX4 Wag 6 21.0 110
3 Hornet 4 Drive 6 21.4 110
4 Hornet Sportabout 8 18.7 175
5 Valiant 6 18.1 105
Separate conditions with a comma for an AND relationship.
Use the |
operator for an OR relationship.
slice
slice()
allows you to select rows by their position.
Select the 5th row:
Select a random sample of rows:
Let’s create a sample data frame with missing values (NA
).
is.na()
returns a logical vector indicating which values are missing.
You can use filter()
with !is.na()
to remove rows with missing values in a specific column.
tidyr::drop_na()
removes rows with any missing values.
group_by()
groups the data by one or more variables. summarise()
then collapses each group into a single-row summary.
Calculate the mean, min, max, and sum of mpg
for each cylinder group.
n()
counts the total number of records in each group, while n_distinct()
counts the number of unique values.
Window functions are functions that operate on a “window” of data (e.g., within a group) but, unlike summarise()
, they return a value for every row.
row_number
row_number()
assigns a unique rank to each row within a group.
lag
and lead
lag()
and lead()
are useful for comparing a value to its predecessor or successor.
cumsum()
calculates the cumulative sum.
arrange()
sorts the rows of a data frame by one or more columns.
car_name cyl mpg hp
1 Datsun 710 4 22.8 93
2 Mazda RX4 6 21.0 110
3 Mazda RX4 Wag 6 21.0 110
4 Hornet 4 Drive 6 21.4 110
5 Valiant 6 18.1 105
6 Hornet Sportabout 8 18.7 175
Sort in descending order using desc()
.
First, let’s create two small data frames to demonstrate joins.
inner_join()
returns only the rows where the key exists in both tables.
left_join()
returns all rows from the left table, and matching rows from the right table.
anti_join()
returns all rows from the left table that do not have a match in the right table.
Reshaping data involves converting it between “wide” and “long” formats.
pivot_longer
pivot_longer()
makes data “longer” by gathering multiple columns into key-value pairs.
# A tibble: 12 × 4
subject sex condition measurement
<int> <chr> <chr> <dbl>
1 1 M control 7.9
2 1 M cond1 12.3
3 1 M cond2 10.7
4 2 F control 6.3
5 2 F cond1 10.6
6 2 F cond2 11.1
7 3 F control 9.5
8 3 F cond1 13.1
9 3 F cond2 13.8
10 4 M control 11.5
11 4 M cond1 13.4
12 4 M cond2 12.9
pivot_wider
pivot_wider()
does the opposite, making data “wider” by spreading a key-value pair into multiple columns.
stringr
provides a consistent and user-friendly interface for common string operations.
str_detect()
checks for the presence of a pattern and returns a logical vector.
[1] TRUE FALSE TRUE TRUE FALSE
str_replace_all()
replaces all matches of a pattern with a new string.
tidyr::separate()
splits a column into multiple new columns based on a delimiter.
str_extract()
pulls out the first match of a regular expression.
lubridate
simplifies working with dates and times in R.
lubridate
provides helper functions like ymd()
to parse strings into dates reliably.
Easily extract components like year, month, or day from a date object.
Perform calculations with date objects.
car_name cyl mpg hp
[1,] 4 6 21.0 110
[2,] 5 6 21.0 110
[3,] 1 4 22.8 93
[4,] 2 6 21.4 110
[5,] 3 8 18.7 175
[6,] 6 6 18.1 105
$car_name
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
$cyl
[1] 6 6 4 6 8 6
$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1
$hp
[1] 110 110 93 110 175 105
https://dplyr.tidyverse.org/
https://stringr.tidyverse.org/
https://lubridate.tidyverse.org/
---
title: "Data Manipulation with tidyverse"
execute:
warning: false
error: false
format:
html:
toc: true
toc-location: right
code-fold: show
code-tools: true
number-sections: true
code-block-bg: true
code-block-border-left: "#31BAE9"
---
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
{width="600"}
`dplyr` is a core tidyverse package that provides a consistent and intuitive grammar for data manipulation.
{width="231"}
# Load Packages and Data
First, we load the `tidyverse` suite of packages.
```{r}
library(tidyverse)
```
We will use the built-in `mtcars` dataset for these examples. For simplicity, we select a few columns and convert the row names to a proper column called `car_name`.
```{r}
data(mtcars)
small_mtcars <- mtcars %>%
select(cyl, mpg, hp) %>%
head()
small_mtcars <- rownames_to_column(small_mtcars, var = "car_name")
small_mtcars
```
# Data Manipulation with dplyr
## Select Columns
### Get Column Names
The `names()` function returns a character vector of the column names in a data frame.
```{r}
names(small_mtcars)
```
### Select by Name
The `select()` verb chooses columns by name.
```{r}
small_mtcars %>% select(cyl, mpg, hp)
```
### Select with Helpers
`select()` can be used with helper functions like `contains()` to choose columns based on patterns in their names.
```{r}
small_mtcars %>% select(contains("p"))
```
### Select by Index
You can also select columns by their position.
```{r}
small_mtcars %>% select(1, 2)
```
### Drop Columns
Use the `-` sign to deselect or drop columns.
```{r}
small_mtcars %>% select(-cyl)
```
## Rename Columns
The `rename()` verb changes the name of a column.
```{r}
small_mtcars %>% rename(new_cyl = cyl)
```
## Create Columns
### Mutate
`mutate()` adds new columns or transforms existing ones.
```{r}
small_mtcars %>% mutate(new_cyl = cyl + 1)
```
### Conditional Creation with `if_else`
`if_else()` is a vectorized conditional statement, useful inside `mutate()`.
```{r}
small_mtcars %>% mutate(new_cly_group = if_else(cyl > 6, 'big', 'small'))
```
### Conditional Creation with `case_when`
`case_when()` is useful for multiple conditions (like a multi-part if-else statement).
```{r}
small_mtcars %>% mutate(cly_group = case_when(
cyl > 6 ~ "very big",
cyl > 4 ~ "big",
TRUE ~ "other"
))
```
### Transmute
`transmute()` is like `mutate()`, but it drops all other columns, keeping only the newly created ones.
```{r}
small_mtcars %>% transmute(new_cyl = cyl + 1)
```
## Filter Rows
`filter()` subsets rows based on logical conditions.
```{r}
small_mtcars %>% filter(cyl > 5)
```
### Filters with AND conditions
Separate conditions with a comma for an AND relationship.
```{r}
small_mtcars %>% filter(cyl > 5, mpg > 20)
```
### Filters with OR conditions
Use the `|` operator for an OR relationship.
```{r}
small_mtcars %>% filter(cyl > 5 | mpg > 20)
```
### Filter Rows by Index with `slice`
`slice()` allows you to select rows by their position.
Select the 5th row:
```{r}
small_mtcars %>% slice(5)
```
Select a random sample of rows:
```{r}
small_mtcars %>% sample_n(3)
```
## Handle Missing Data
Let's create a sample data frame with missing values (`NA`).
```{r}
missing_df <- data.frame(
x = c(1, 2, NA, 4),
y = c("a", NA, "c", "d")
)
```
### Find Missing Data
`is.na()` returns a logical vector indicating which values are missing.
```{r}
is.na(missing_df)
```
### Filter Out Missing Data
You can use `filter()` with `!is.na()` to remove rows with missing values in a specific column.
```{r}
missing_df %>% filter(!is.na(y))
```
`tidyr::drop_na()` removes rows with any missing values.
```{r}
missing_df %>% drop_na()
```
## Group and Summarize Data
`group_by()` groups the data by one or more variables. `summarise()` then collapses each group into a single-row summary.
### Common Summary Functions
Calculate the mean, min, max, and sum of `mpg` for each cylinder group.
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg),
min_mpg = min(mpg),
max_mpg = max(mpg),
sum_mpg = sum(mpg)
)
```
### Count Records
`n()` counts the total number of records in each group, while `n_distinct()` counts the number of unique values.
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(
n_mpg = n(),
distinct_n_mpg = n_distinct(mpg)
)
```
## Window Functions
Window functions are functions that operate on a "window" of data (e.g., within a group) but, unlike `summarise()`, they return a value for every row.
### Ranking with `row_number`
`row_number()` assigns a unique rank to each row within a group.
```{r}
small_mtcars %>%
group_by(cyl) %>%
mutate(rank = row_number(desc(mpg))) %>%
select(cyl, mpg, rank)
```
### Accessing Previous/Next Values with `lag` and `lead`
`lag()` and `lead()` are useful for comparing a value to its predecessor or successor.
```{r}
small_mtcars %>%
select(cyl, mpg) %>%
mutate(mpg_previous = lag(mpg, n = 1))
```
### Cumulative Summaries
`cumsum()` calculates the cumulative sum.
```{r}
small_mtcars %>%
select(cyl, mpg) %>%
mutate(mpg_running_total = cumsum(mpg))
```
## Order Rows
`arrange()` sorts the rows of a data frame by one or more columns.
```{r}
small_mtcars %>% arrange(cyl)
```
Sort in descending order using `desc()`.
```{r}
small_mtcars %>% arrange(desc(cyl))
```
## Join Tables
First, let's create two small data frames to demonstrate joins.
```{r}
left_data = small_mtcars %>% slice(1:3)
right_data = small_mtcars %>% slice(2:4)
```
### Inner Join
`inner_join()` returns only the rows where the key exists in both tables.
```{r}
left_data %>% inner_join(right_data, by = "car_name")
```
### Left Join
`left_join()` returns all rows from the left table, and matching rows from the right table.
```{r}
left_data %>% left_join(right_data, by = "car_name", suffix = c("_l", "_r"))
```
### Anti Join
`anti_join()` returns all rows from the left table that do not have a match in the right table.
```{r}
left_data %>% anti_join(right_data, by = "car_name")
```
## Reshape Tables
Reshaping data involves converting it between "wide" and "long" formats.
```{r}
olddata_wide <- read.table(header = TRUE, text = '
subject sex control cond1 cond2
1 M 7.9 12.3 10.7
2 F 6.3 10.6 11.1
3 F 9.5 13.1 13.8
4 M 11.5 13.4 12.9
')
```
### Wide to Long with `pivot_longer`
`pivot_longer()` makes data "longer" by gathering multiple columns into key-value pairs.
```{r}
data_long = olddata_wide %>%
pivot_longer(!c(subject, sex), names_to = 'condition', values_to = 'measurement')
data_long
```
### Long to Wide with `pivot_wider`
`pivot_wider()` does the opposite, making data "wider" by spreading a key-value pair into multiple columns.
```{r}
data_wide = data_long %>%
pivot_wider(names_from = condition, values_from = measurement)
data_wide
```
# String Manipulation with stringr
`stringr` provides a consistent and user-friendly interface for common string operations.

## String Operations
```{r}
x <- "I like horses."
str_length(x)
```
```{r}
str_to_upper(x)
```
## Pattern Matching
`str_detect()` checks for the presence of a pattern and returns a logical vector.
```{r}
word_list = c('abc', 'bbc', 'apple', 'bbaa', 'cc')
word_list %>% str_detect('a')
```
`str_replace_all()` replaces all matches of a pattern with a new string.
```{r}
text001 = "abcb"
text001 %>% str_replace_all('b', '1')
```
## Splitting Strings
`tidyr::separate()` splits a column into multiple new columns based on a delimiter.
```{r}
df_word <- data.frame(word = c('a-b', '1-c', 'c-c'))
df_word %>% separate(word, c('col1', 'col2'), '-')
```
## Extracting Substrings
`str_extract()` pulls out the first match of a regular expression.
```{r}
trx = 'abc1993 ccc'
trx %>% str_extract("(\\d)+")
```
# Date and Time Manipulation with lubridate
`lubridate` simplifies working with dates and times in R.

```{r}
library(lubridate)
```
## Parsing Dates
`lubridate` provides helper functions like `ymd()` to parse strings into dates reliably.
```{r}
date3 = ymd('2023-01-01')
class(date3)
```
## Date Components
Easily extract components like year, month, or day from a date object.
```{r}
now_time = now()
year(now_time)
```
```{r}
month(now_time)
```
```{r}
wday(now_time, label = TRUE)
```
## Date Arithmetic
Perform calculations with date objects.
```{r}
day1 = ymd('2022-01-01')
day2 = ymd('2023-02-03')
# Calculate the time difference
interval(day1, day2) %/% months(1)
```
# dataframe to other data format
## dataframe to vector
```{r}
data=small_mtcars$cyl
data
```
```{r}
class(data)
```
## dataframe to matrix
```{r}
data=data.matrix(small_mtcars)
data
```
```{r}
class(data)
```
## dataframe to list
```{r}
data=as.list(small_mtcars)
data
```
```{r}
class(data)
```
# Reference:
https://dplyr.tidyverse.org/
https://stringr.tidyverse.org/
https://lubridate.tidyverse.org/